Skip to content

Conversation

@hzg0601
Copy link

@hzg0601 hzg0601 commented Aug 8, 2023

  1. support load dataset from local path in offline mode
    For those who can't access huggingface hub stably, they have to download the repo of huggingface hub manually and
    upload dataset in the offline mode. However, current implementation use the model_name as the key to load corresponding dataset class of raw_datasets.py which would raise RuntimeError( f"We do not have configs for dataset {dataset_name}, but you can add it by yourself in raw_datasets.py." ). This PR fix the problem by split the string of local path.
  2. Causally, some repo owners of huggingface hub, such as bigscience, would like to separate model and tokenizer. So we have to add a tokenizer_name_or_path option in args in case of that situation.

@hzg0601
Copy link
Author

hzg0601 commented Aug 8, 2023

@microsoft-github-policy-service agree

@conglongli
Copy link
Contributor

@hzg0601 Before reviewing this PR, we need your help on two things:

(1) Why your PR is deleting this log file? https://github.com/microsoft/DeepSpeedExamples/blob/04b1036349b0d02e0f940c8431414ef18e96ddf2/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_log_output/opt-1.3b-globalBatchSize128.log

If this is by mistake, please restore it. If this is not by mistake, please give a clear reason of why deleting it.

(2) The formatting test is failing. Please follow my comment at here #597 (comment) to fix it.

@hzg0601 hzg0601 closed this Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants